Author: Dylan Lawless, PhD

Affiliation: Department of Intensive Care and Neonatology, University Children’s Hospital Zurich, University of Zurich.


h2030gc fastq file names:
<SAMPLE_ID>_<NGS_ID>_<POOL_ID>_<S#>_<LANE>_<R1|R2>.fastq.gz
and Illumina fastq header:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

VEP Consequences

Note: Ensure that the path to the image is correct and accessible from the document. If VEP_consequences is not the full filename, include the proper extension, like .png or .jpg.

1 Introduction

The two datasets (Sample ID: XYZ_STUDY_D.XYZ003_DNA and reference CH S2) of paired-end short reads were generated from the same human DNA NGS library protocol for clinical diagnosis of phenotype X. Clinical grade sequencing (ISO 15189 accredited) was used to generate whole genome sequence (WGS) data at the Swiss Multi-Omic Center (SMOC) (SMOC). The Illumina Novaseq6000 platform was used in combination with TruSeq DNA PCR-Free library preparation. Analysis was performed with reference to GRCh38. Handling of sensitive clinical data according to established SPHN/BioMedIT guidelines on the sciCORE platform as part of SwissPedHealth (SPHN/BioMedIT). Herein, we evaluate sample performance for use in clinical diagnosis.

2 Results

2.1 Causal variant summary

  • Sample ID: XYZ_STUDY_D.XYZ003_DNA
  • HGVSc coding variant: ENST00000380588.4:c.53G>A
  • Protein variant: ENSP00000369962.4:p.Trp18Ter
  • ACMG score: 2

Read about the ACMG classification scoring system and protocol here.

VEP Consequences

2.2 3D Structure Viewer

To do: forward pdb from AutoDesrtuctR.

2.3 Causal variant evidence interpretation

The analysis of variant interpretation performed by ACMGuru is summarised from the following evidence sources:

Figure: AutoDestructR Single Case - This figure includes the use of gene structure and functional data including sources from UniProt, protein structure data from PDB and AlphaFold.

Figure: Protein Pathway Construction Whole Genome V1 Compared. This figure includes the use of protein pathway construction from STRING (GO, KEGG, Reactome, etc.). Protein Pathway Construction Whole Genome V1 Compared

Figure: Combined GO Plots - This includes biological protein pathway information from GO. Combined GO Plots Figure: QQ Plot Data from Joint Cohort Analysis - Contains the QQ plot data from the joint cohort analysis of single variants. QQ Plot Data

Figure: Protein Pathway Network 22 - Contains the protein pathway identified as enriched in patients sharing the same biological mechanism as cause of disease.

3 Methods

VEP Consequences

This analysis consisted of germline short variant discovery (SNPs + Indels) using GATK v4.5.0 and interpretation using ACMGuru. Read the full public methods protocol page use here.

4 Fastq Data

To assess the quality of fastq data, FastQC was used. Full HTML reports for each file are linked below:

The results of FastQC were also assessed by use of fastqcr. The full HTML report is linked here: - Report assessment of FastQC

4.1 Quality Assessment

  1. Total sequences or the number of reads for each sample: 1,000,000
  2. Per base sequence quality: All samples performed sufficiently.
    • Median value (red line): AH good quality (qual >28), CH good quality (qual >28).
    • Inter-quartile range (25-75%) (yellow box): AH good quality (qual >28), CH medium to good quality (qual >20).
    • Upper and lower 10% and 90% whiskers points: AH medium to good quality (qual >28), CH poor to good quality (qual >14).
    • Mean quality (blue line): AH good quality (qual >28), CH medium to good quality (qual >20).
  3. Per tile sequence quality: All samples performed sufficiently. No warning.
  4. Per sequence quality score: All samples performed sufficiently, summarised in the figure below.
  5. Per base sequence content: All samples flagged with a warning indicating a difference greater than 10% in any position. However, this is potentially due to targeted capturing.
  6. Per sequence GC content: All samples failed based on modal GC content as calculated from the observed data and used to build a reference distribution. The sum of the deviations from the normal distribution represents more than 30% of the reads. However, the sharp peaks are most likely due to enriched duplicate sequences from targeted capturing and do not necessarily indicate poor quality.
  7. Sequence Length Distribution: AH reads were all 150, while CH reads were 35-151.
  8. Sequence Duplication Levels: Percentage of duplicate reads were AH 96.01%-96.55% and CH 65.44%-67.13%.
  9. Adapter Content: Detailed below.

4.2 Figures

Per base sequence quality score: [Top] AH [Bottom] CH. AH outperformed CH for both reads. Central red line shows the median value. Inter-quartile range 25-75% (yellow box). Upper and lower whiskers represent the 10% and 90% points. Mean quality (blue line).Per base sequence quality score: [Top] AH [Bottom] CH. AH outperformed CH for both reads. Central red line shows the median value. Inter-quartile range 25-75% (yellow box). Upper and lower whiskers represent the 10% and 90% points. Mean quality (blue line).

5 Alignment Data

Fastq files were trimmed using TrimGalore with the use of cutadapt.

Reads were aligned to GRCh37 using BWA MEM and converted to bam format with samtools.

The alignment data was assessed using:

Qualimap full HTML report links: - Sample AH - Sample CH

5.1 Alignment Summary

  1. Samtools flagstat mapping summary shows alignment performance with GRCh37 for sorted reads, detailed in the table below.
  2. Mapping quality histogram indicates that AH performed better than CH.
  3. Genome coverage histogram shows that AH produced a normal distribution of coverage depths while CH had an enrichment for some genomic regions.
  4. The duplication rate histograms are shown below.
  5. Genome coverage across GRCh37 shows a uniform distribution of reads for AH [Top], while CH [Bottom] has high depth in some regions with lower coverage in others.

5.2 Tables and Figures

Samtools flagstat mapping summary. Alignment with GRCh37, sorted reads.
Metric CH AH
In total (QC-passed reads & + QC-failed reads) 2011262 1999498
Secondary 15710 612
Supplementary 0 0
Duplicates 0 0
Mapped (99.76% : N/A, 99.92% : N/A) 2006501 1997929
Paired in sequencing 1995552 1998886
Read1 997776 999443
Read2 997776 999443
Properly paired (98.64% : N/A, 99.69% : N/A) 1968314 1992738
With itself and mate mapped 1986886 1996840
Singletons (0.20% : N/A, 0.02% : N/A) 3905 477
With mate mapped to a different chr 14488 1612
With mate mapped to a different chr (mapQ>=5) 8748 1426

5.3 Mapping Quality Histogram

5.4 Genome Coverage Histogram

5.5 Duplication Rate Histogram

5.6 Genome Coverage Across Reference

6 About

This document’s source code is available from the GitHub repository.

All code used in this report is available on the GitHub repository.